Welcome! In this interactive tutorial you will learn how to use data science skills to analyses maternal and child health. Specifically you will be exploring birthweight and factors that can lead to low birthweight using the R statistical software.
Note This section is designed as an interactive tutorial, so get ready to be presented with questions and exercises! The reading time is around 20 minutes and this should give you a gentle introduction to the types of content covered in the Masters of Science in Health Data Science.
Why Study Birthweight?
Birthweight is a crucial indicator of a newborn’s health and well-being. It serves as a fundamental metric in assessing a baby’s initial growth and development. Moreover, birthweight plays a pivotal role in predicting the infant’s short-term and long-term health outcomes. Babies born with low birthweight, typically defined as weighing less than 2,500 grams at birth, face increased risks of complications, developmental issues, and chronic health conditions.
created by Stable Diffusion
Factors Influencing Low Birthweight
There are many factors that may contribute to low birthweight:
Maternal Nutrition: Adequate maternal nutrition is paramount for the proper growth and development of the fetus.
Maternal Health Conditions: Certain maternal health conditions, such as hypertension, diabetes, and infections, can impact fetal growth and contribute to low birthweight.
Lifestyle Factors: Maternal lifestyle choices, including smoking, alcohol consumption, and illicit drug use, have been linked to low birthweight.
Socioeconomic Factors: Limited access to healthcare, education, and resources can contribute to low birthweight.
Multiple Pregnancies: Twins, triplets, or other multiple pregnancies are at a higher risk of low birthweight due to the shared resources in the womb.
Birthweight in context
In New South Wales, data on birthweight are routinely recorded in the Perinatal Data Collection, a population-based surveillance system covering all births in NSW public and private hospitals, as well as home births.
True or False? A boy born weighing 2.2kg would be classified as low birthweight?
Which of the following is not a risk factor for low birthweight?
Data Analysis
Introduction
Great, let’s get started with some data analysis! In order to explore the relationship between birthweight and other factors we are going to need some data. Luckily, the R ecosystem is chockablock with example datasets, including the birthwt dataset which has ten variables collected on 189 mother-baby pairs The data were originally collected at Baystate Medical Center, Springfield, Massachusetts during 1986.
created by Stable Diffusion
The birthwt dataset, is shipped with the MASS R package, so we will load this first by calling library(MASS). We’ll also load the dplyr package, which has helpful tools for data manipulation.
R Packages
R packages are collections of R functions, data, and compiled code bundled together in a standardised format. These packages extend the functionality of base R by providing additional tools for specific tasks or domains. R packages can be created by anyone and are typically shared with the R community through platforms like CRAN (Comprehensive R Archive Network) or GitHub.
dplyr is a popular and highly useful package for data manipulation. Artwork by @allison_horst
Below is an example of the R code you would write to load the required packages and view the first few records for the birthwt dataset.
The library() function loads packages
The select() function selects specific variables
The head() function prints the first six observations
library(MASS) # Includes the birthweight datasetlibrary(dplyr) # Tools for manipulating databirthwt |>select(age, smoke, ht, bwt) |>head()
To get an idea of the data, we have selected the following four variables:
age Mother’s age in years
smoke Smoking status during pregnancy (0=No, 1 = Yes)
ht History of hypertension (0=No, 1 = Yes)
bwt Birthweight in grams.
Let’s look at a statistical summary of each of these variables. We can do this with the summary() function, as below.
birthwt |>select(age, smoke, ht, bwt) |>summary()
age smoke ht bwt
Min. :14.00 Min. :0.0000 Min. :0.00000 Min. : 709
1st Qu.:19.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:2414
Median :23.00 Median :0.0000 Median :0.00000 Median :2977
Mean :23.24 Mean :0.3915 Mean :0.06349 Mean :2945
3rd Qu.:26.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:3487
Max. :45.00 Max. :1.0000 Max. :1.00000 Max. :4990
For example, we can see that in this dataset the median maternal age is 23 years and around 39% of mums smoked during pregnancy.
Test your understanding
True or False? The oldest maternal age recorded was 40 years?
The median birthweight was grams.
Let’s compare the average birthweight among mums who did and did not smoke during pregnancy. One way to do this is to use the group_by() function to group the data by smoking status and then calculate the mean in each group, as below.
# A tibble: 2 × 2
smoke mean
<int> <dbl>
1 0 3056.
2 1 2772.
There is clearly an association between maternal smoking and birthweight. The average birthweight was 3,056 grams for children born to non-smokers compared to 2,772 grams for children born to mums who smoked during pregnancy.
Now it’s your turn to try coding! In this interactive exercise, update the code below to calculate the average birthweight among mums who did and did not have a history of hypertension. Remember, history of hypertension is recorded in the variable named ht.
# A tibble: 2 × 2
ht mean
<int> <dbl>
1 0 2972.
2 1 2537.
Answer the questions below based on the results of your analysis:
True or False? Children born to mothers with a history of hypertension have lower birthweights on average?
The difference in birthweight between children born to mothers with and without hypertension (to the nearest gram) is grams.
Data Visualisaion
Introduction
Imagine being armed with the ability to uncover hidden patterns, communicate complex findings, and make impactful decisions that directly influence health outcomes. That’s precisely what data visualization empowers you to do!
created by Stable Diffusion
More R Packages
R offers diverse data visualisation tools, but ggplot2 stands out for its popularity, providing a versatile and user-friendly approach for creating insightful and polished graphics. This package implements a consistent grammer of graphics allowing many different chart types to be created using a single framework.
In the previous section we confirmed that babies born to mums who smoked during pregnancy had lower average birthweight compared to those that did not smoke. Let’s use data visualisation to compare the full distribution of birthweights by maternal smoking status. Note that you can click on the Code icon to reveal the underlying code.
This type of chart is called a grouped boxplot. There are two boxplots presented, one for the non-smoking group and one for the smoking group. The dark horizontal line at the centre of the box indicates the median birthweight for each group. The upper and lower borders of the boxes indicate the 25th and 75th quartiles and the vertical spikes emerging at either end indicate the full range of the data.
Below is a visualisation of the same data using a density plot. If you compare the code for this plot to the previous one, you will see that very little has changed. This is the power of the ggplot package: one consitent framework can produce many different types of graphs!
Both plots above illustrate the same messgage, babies born to mums who smoked during pregnancy have a lower birthweight on average, compared to babies of non-smokers, although there is a lot of overlap for both groups.
Code
birthwt |>filter(age <=40) |>mutate(smokeCategorical =factor(smoke, labels =c('Non-smoking', 'Smoking') )) |>ggplot(aes(x = age, y = bwt, color = smokeCategorical, fill = smokeCategorical, shape = smokeCategorical)) +geom_point() +geom_smooth(method ='lm') +scale_x_continuous("Maternal age (years)") +scale_y_continuous("Birthweight (grams)", labels = scales::comma) +scale_shape_manual("Smoking status", values =c(21, 22)) +scale_color_manual("Smoking status", values =c('#03d77f', '#fb706a')) +scale_fill_manual("Smoking status", values =lighten(c('#03d77f', '#fb706a'), 0.4)) +labs(title="Birthweight by maternal age and maternal smoking status") +theme_minimal() +theme(legend.position ='top')
Which R package provides tools for data visualisation?
Which statement is most accurate based on the figure above?
Interactive exercise
Now it’s your turn to try coding! In this interactive exercise, update the code below to visualise the distribution of birthweight by history of hypertension. Remember, history of hypertension is recorded in the variable named ht.
Answer the questions below based on the results of your analysis:
The difference in median birthweight between children born to mothers with and without hypertension is approximately grams.
Among children born to mothers with no history of hypertension, % of births weighed 3,500 grams or above.
Statistical Modelling
Introduction
Statistical modeling is the cornerstone of evidence-based decision-making in healthcare. By crunching numbers and analysing patterns, we gain insights into disease trends, treatment effectiveness, and the factors influencing health outcomes. It’s like having a powerful magnifying glass that allows us to zoom in on the crucial details within a vast dataset.
created by Stable Diffusion
Enter the workhorse of statistical modeling – linear regression. At its core, linear regression helps us explore the linear relationship between a dependent variable and one or more independent variables. Suppose we’re examining the impact of factors like maternal age, smoking, and hypertension on birthweight. Linear regression enables us to quantify these relationships, providing us with coefficients that represent the strength and direction of each influence. It’s like having a mathematical roadmap to understand how changes in one variable correlate with changes in another.
Let’s take another look at the grouped scatterplot showing birthweight by maternal smoking status. We can see the distribution of birthweight looks to be higher for babies of non-smoking mums compared to smoking mums.
But can we be sure this is the right interpretation? Is the difference meaningful or could it just be noise in the data or random chance? We can fit a linear regression model to quantify the hypothesised relationship between smoking status and birthweight as well as get a sense of the certainty of the estimate.
Below is the code to fit a linear model in R using the lm() function. Note that we specify the dependent variable bwt, the independent variable smoke, and the dataset birthwt.
model1 <-lm(bwt ~ smoke, data = birthwt)
And here are the model results, presented as a table.
tab_model(model1, digits =0, title ='Birthweight', show.r2 =FALSE)
Birthweight
bwt
Predictors
Estimates
CI
p
(Intercept)
3056
2924 – 3188
<0.001
smoke
-284
-495 – -73
0.009
Observations
189
We can interpret this as follows:
The average birthweight among babies born to non-smokers was 3,056 grams.
The 95% confidence interval (CI) for this estimate ranges from 2,294 grams to 3,188 grams. This is the range of values within which we are 95% confident that the true population coefficient lies. In other words, if you were to conduct the same study multiple times and calculate a 95% confidence interval for the coefficient for non-smokers each time, you would expect the true coefficient to fall within the range 2,294–3,188 grams in 95% of those intervals.
The average birthweight among babies born to smokers was 284 grams less than babies born to non-smokers.
The 95% confidence interval (CI) for this estimate ranges from -495 grams to -73 grams. This is the range of values within which we are 95% confident that the true population coefficient lies. In other words, if you were to conduct the same study multiple times and calculate a 95% confidence interval for the coefficient for smokers each time, you would expect the true coefficient to fall within the range -495—73 grams in 95% of those intervals.
Test your understanding
The estimated coefficient for maternal smoking is grams
True or False? The 95% confidence interval for the coefficient for maternal smoking includes 0
Now it’s your turn to try coding! In this interactive exercise, update the code below to model the relationship between birthweight and history of hypertension. Remember, history of hypertension is recorded in the variable named ht.
Answer the questions below based on the results of your analysis:
The difference in median birthweight between children born to mothers with and without hypertension is approximately grams.
Among children born to mothers with no history of hypertension, % of births weighed 3,500 grams or above.
Next steps
Embarking on a career in Health data Science is a exciting, challenging, and rewarding journey. As your numerical literacy expands, you gain the ability to immerse yourself in the rows of a complex health dataset, much like delving into the chapters of a novel. In this vast narrative of data, each variable and data point becomes a sentence, and through meticulous analysis, you decipher the underlying story. Just as a seasoned reader navigates plot twists and character developments, a proficient data scientist navigates through statistical intricacies, extracting meaningful insights that unfold like a well-crafted narrative. The more adept you become at interpreting these data “chapters”, the richer and more nuanced the story of health trends and patterns becomes. And just like a compelling novel, the journey of analysing data in health science is simultaneously challenging and profoundly rewarding.
I encourage you to dive into the vibrant world of Health Data Science with enthusiasm and curiosity. Remember to check out our second introductory tutorial focusing on Python coding and Machine Learning and read about the student experience at the Centre for Big Data Research in Health.
Footnotes
Ghimire, P.R.; Mooney, J.; Fox, L.; Dubois, L. Smoking Cessation during the Second Half of Pregnancy Prevents Low Birth Weight among Australian Born Babies in Regional New South Wales. Int. J. Environ. Res. Public Health 2021, 18, 3417. https://doi.org/10.3390/ijerph18073417↩︎